{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 原文地址 [blog.csdn.net](https://blog.csdn.net/lys\\_828/article/details/106489371)\n",
    "\n",
    "### 利用 FuzzyWuzzy 库匹配字符串\n",
    "\n",
    "*   [1\\. 背景前言](#1__1)\n",
    "*   [2\\. FuzzyWuzzy 库介绍](#2_FuzzyWuzzy_7)\n",
    "*   *   [2.1 安装](#21__8)\n",
    "    *   [2.1 fuzz 模块](#21_fuzz_17)\n",
    "    *   *   [2.1.1 简单匹配（Ratio）](#211_Ratio_22)\n",
    "        *   [2.1.2 非完全匹配（Partial Ratio）](#212_Partial_Ratio_31)\n",
    "        *   [2.1.3 忽略顺序匹配（Token Sort Ratio）](#213_Token_Sort_Ratio_40)\n",
    "        *   [2.1.4 去重子集匹配（Token Set Ratio）](#214_Token_Set_Ratio_53)\n",
    "    *   [2.2 process 模块](#22_process_67)\n",
    "    *   *   [2.2.1 extract 提取多条数据](#221_extract_69)\n",
    "        *   [2.2.2 extractOne 提取一条数据](#222_extractOne_78)\n",
    "*   [3\\. 实战应用](#3__87)\n",
    "*   *   [3.1 公司名称字段模糊匹配](#31__89)\n",
    "    *   *   [3.1.1 参数讲解：](#311__94)\n",
    "        *   [3.1.2 核心代码讲解](#312__109)\n",
    "    *   [3.2 省份字段模糊匹配](#32__132)\n",
    "*   [4\\. 全部函数代码](#4__136)\n",
    "\n",
    "1\\. 背景前言\n",
    "========\n",
    "\n",
    "在处理数据的过程中，难免会遇到下面类似的场景，自己手里头获得的是简化版的数据字段，但是要比对的或者要合并的却是完整版的数据（有时候也会反过来）\n",
    "\n",
    "最常见的一个例子就是：在进行地理可视化中，自己收集的数据只保留的缩写，比如北京，广西，新疆，西藏等，但是待匹配的字段数据却是北京市，广西壮族自治区，新疆维吾尔自治区，西藏自治区等，如下。因此就需要有没有一种方式可以很快速便捷的直接进行对应字段的匹配并将结果单独生成一列，就可以用到 FuzzyWuzzy 库  \n",
    "![](https://img-blog.csdnimg.cn/20200602094756748.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2x5c184Mjg=,size_16,color_FFFFFF,t_70)\n",
    "\n",
    "2\\. FuzzyWuzzy 库介绍\n",
    "==================\n",
    "\n",
    "2.1 安装\n",
    "------\n",
    "\n",
    "这里使用的是 Anaconda 下的 jupyter notebook 编程环境，因此在 Anaconda 的命令行中输入一下指令进行第三方库安装\n",
    "\n",
    "```\n",
    "pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy\n",
    "\n",
    "```\n",
    "\n",
    "→ 输出的结果为：（如果使用本地的 python，可以直接 cmd 后安装）  \n",
    "![](https://img-blog.csdnimg.cn/20200602095258248.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2x5c184Mjg=,size_16,color_FFFFFF,t_70)\n",
    "\n",
    "2.1 fuzz 模块\n",
    "-----------\n",
    "\n",
    "该模块下主要介绍四个函数（方法），分别为：简单匹配（Ratio）、非完全匹配（Partial Ratio）、忽略顺序匹配（Token Sort Ratio）和去重子集匹配（Token Set Ratio）\n",
    "\n",
    "**注意，注意：** 如果直接导入这个模块的话，系统会提示 warning，当然这不代表报错，程序依旧可以运行（使用的默认算法，执行速度较慢），可以按照系统的提示安装 [python-Levenshtein 库](https://www.lfd.uci.edu/~gohlke/pythonlibs/#python-levenshtein)进行辅助，这有利于提高计算的速度  \n",
    "![](https://img-blog.csdnimg.cn/20200602101714978.png)\n",
    "\n",
    "### 2.1.1 简单匹配（Ratio）\n",
    "\n",
    "简单的了解一下就行，这个不怎么精确，也不常用\n",
    "\n",
    "```\n",
    "fuzz.ratio(\"河南省\", \"河南省\")\n",
    ">>> 100\n",
    ">\n",
    "fuzz.ratio(\"河南\", \"河南省\")\n",
    ">>> 80\n",
    "\n",
    "```\n",
    "\n",
    "### 2.1.2 非完全匹配（Partial Ratio）\n",
    "\n",
    "尽量使用非完全匹配，精度较高\n",
    "\n",
    "```\n",
    "fuzz.partial\\_ratio(\"河南省\", \"河南省\")\n",
    ">>> 100\n",
    "\n",
    "fuzz.partial\\_ratio(\"河南\", \"河南省\")\n",
    ">>> 100\n",
    "\n",
    "```\n",
    "\n",
    "### 2.1.3 忽略顺序匹配（Token Sort Ratio）\n",
    "\n",
    "原理在于：以 **空格** 为分隔符，**小写** 化所有字母，无视空格外的其它标点符号\n",
    "\n",
    "```\n",
    "fuzz.ratio(\"西藏 自治区\", \"自治区 西藏\")\n",
    ">>> 50\n",
    "fuzz.ratio('I love YOU','YOU LOVE I')\n",
    ">>> 30\n",
    "\n",
    "fuzz.token\\_sort\\_ratio(\"西藏 自治区\", \"自治区 西藏\") \n",
    ">>> 100\n",
    "fuzz.token\\_sort\\_ratio('I love YOU','YOU LOVE I') \n",
    ">>> 100\n",
    "\n",
    "```\n",
    "\n",
    "### 2.1.4 去重子集匹配（Token Set Ratio）\n",
    "\n",
    "相当于比对之前有一个集合去重的过程，注意最后两个，可理解为该方法是在 token\\_sort\\_ratio 方法的基础上添加了集合去重的功能，下面三个匹配的都是倒序\n",
    "\n",
    "```\n",
    "fuzz.ratio(\"西藏 西藏 自治区\", \"自治区 西藏\")\n",
    ">>> 40\n",
    "\n",
    "fuzz.token\\_sort\\_ratio(\"西藏 西藏 自治区\", \"自治区 西藏\")\n",
    ">>> 80\n",
    "\n",
    "fuzz.token\\_set\\_ratio(\"西藏 西藏 自治区\", \"自治区 西藏\")\n",
    ">>> 100\n",
    "\n",
    "```\n",
    "\n",
    "fuzz 这几个 ratio() 函数（方法）最后得到的结果都是数字，如果需要获得匹配度最高的字符串结果，还需要依旧自己的数据类型选择不同的函数，然后再进行结果提取，如果但看文本数据的匹配程度使用这种方式是可以量化的，但是对于我们要提取匹配的结果来说就不是很方便了，因此就有了 process 模块\n",
    "\n",
    "2.2 process 模块\n",
    "--------------\n",
    "\n",
    "用于处理备选答案有限的情况，返回模糊匹配的字符串和相似度。\n",
    "\n",
    "### 2.2.1 extract 提取多条数据\n",
    "\n",
    "类似于爬虫中 select，返回的是列表，其中会包含很多匹配的数据\n",
    "\n",
    "```\n",
    "choices = \\[\"河南省\", \"郑州市\", \"湖北省\", \"武汉市\"\\]\n",
    "process.extract(\"郑州\", choices, limit=2)\n",
    ">>> \\[('郑州市', 90), ('河南省', 0)\\]\n",
    "# extract之后的数据类型是列表，即使limit=1，最后还是列表，注意和下面extractOne的区别\n",
    "\n",
    "```\n",
    "\n",
    "### 2.2.2 extractOne 提取一条数据\n",
    "\n",
    "如果要提取匹配度最大的结果，可以使用 extractOne，注意这里返回的是 **元组** 类型， 还有就是匹配度最大的结果**不一定是我们想要的数据**，可以通过下面的示例和两个实战应用体会一下\n",
    "\n",
    "```\n",
    "process.extractOne(\"郑州\", choices)\n",
    ">>> ('郑州市', 90)\n",
    "\n",
    "process.extractOne(\"北京\", choices)\n",
    ">>> ('湖北省', 45)\n",
    "\n",
    "```\n",
    "\n",
    "3\\. 实战应用\n",
    "========\n",
    "\n",
    "这里举两个实战应用的小例子，第一个是公司名称字段的模糊匹配，第二个是省市字段的模糊匹配\n",
    "\n",
    "3.1 公司名称字段模糊匹配\n",
    "--------------\n",
    "\n",
    "数据及待匹配的数据样式如下：自己获取到的数据字段的名称很简洁，并不是公司的全称，因此需要进行两个字段的合并  \n",
    "![](https://img-blog.csdnimg.cn/20200602114627169.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2x5c184Mjg=,size_16,color_FFFFFF,t_70)  \n",
    "直接将代码封装为函数，主要是为了方便日后的调用，这里参数设置的比较详细，执行结果如下：  \n",
    "![](https://img-blog.csdnimg.cn/20200602115239574.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2x5c184Mjg=,size_16,color_FFFFFF,t_70)\n",
    "\n",
    "### 3.1.1 参数讲解：\n",
    "\n",
    "① 第一个参数 df\\_1 是自己获取的欲合并的左侧数据（这里是 data 变量）；\n",
    "\n",
    "② 第二个参数 df\\_2 是待匹配的欲合并的右侧数据（这里是 company 变量）；\n",
    "\n",
    "③ 第三个参数 key1 是 df\\_1 中要处理的字段名称（这里是 data 变量里的‘公司名称’字段）\n",
    "\n",
    "④ 第四个参数 key2 是 df\\_2 中要匹配的字段名称（这里是 company 变量里的‘公司名称’字段）\n",
    "\n",
    "⑤ 第五个参数 threshold 是设定提取结果匹配度的标准。注意这里就是对 extractOne 方法的完善，提取到的最大匹配度的结果并不一定是我们需要的，所以需要设定一个阈值来评判，这个值就为 90，只有是大于等于 90，这个匹配结果我们才可以接受\n",
    "\n",
    "⑥ 第六个参数，默认参数就是只返回两个匹配成功的结果\n",
    "\n",
    "⑦ 返回值：为 df\\_1 添加‘matches’字段后的新的 DataFrame 数据\n",
    "\n",
    "### 3.1.2 核心代码讲解\n",
    "\n",
    "第一部分代码如下，可以参考上面讲解 process.extract 方法，这里就是直接使用，所以返回的结果 m 就是列表中嵌套元祖的数据格式，样式为: \\[(‘郑州市’, 90), (‘河南省’, 0)\\]，因此第一次写入到’matches’字段中的数据也就是这种格式\n",
    "\n",
    "**注意，注意：** 元祖中的第一个是匹配成功的字符串，第二个就是设置的 threshold 参数比对的数字对象\n",
    "\n",
    "```\n",
    "s = df\\_2\\[key2\\].tolist()\n",
    "m = df\\_1\\[key1\\].apply(lambda x: process.extract(x, s, limit=limit))    \n",
    "df\\_1\\['matches'\\] = m\n",
    "\n",
    "```\n",
    "\n",
    "第二部分的核心代码如下，有了上面的梳理，明确了‘matches’字段中的数据类型，然后就是进行数据的提取了，需要处理的部分有两点需要注意的：\n",
    "\n",
    "① 提取匹配成功的字符串，并对阈值小于 90 的数据填充空值\n",
    "\n",
    "② 最后把数据添加到‘matches’字段\n",
    "\n",
    "```\n",
    "m2 = df\\_1\\['matches'\\].apply(lambda x: \\[i\\[0\\] for i in x if i\\[1\\] >= threshold\\]\\[0\\] if len(\\[i\\[0\\] for i in x if i\\[1\\] >= threshold\\]) > 0 else '')\n",
    "#要理解第一个‘matches’字段返回的数据类型是什么样子的，就不难理解这行代码了\n",
    "#参考一下这个格式： \\[('郑州市', 90), ('河南省', 0)\\]\n",
    "df\\_1\\['matches'\\] = m2\n",
    "\n",
    "return df\\_1\n",
    "\n",
    "```\n",
    "\n",
    "3.2 省份字段模糊匹配\n",
    "------------\n",
    "\n",
    "自己的数据和待匹配的数据背景介绍中已经有图片显示了，上面也已经封装了模糊匹配的函数，这里直接调用上面的函数，输入相应的参数即可，代码以及执行结果如下：  \n",
    "![](https://img-blog.csdnimg.cn/20200602122020296.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2x5c184Mjg=,size_16,color_FFFFFF,t_70)  \n",
    "数据处理完成，经过封装后的函数可以直接放在自己自定义的模块名文件下面，以后可以方便直接导入函数名即可，可以参考[将自定义常用的一些函数封装成可以直接调用的模块方法](https://blog.csdn.net/lys_828/article/details/106176229)\n",
    "\n",
    "4\\. 全部函数代码\n",
    "==========\n",
    "\n",
    "```\n",
    "#模糊匹配\n",
    "\n",
    "def fuzzy\\_merge(df\\_1, df\\_2, key1, key2, threshold=90, limit=2):\n",
    "    \"\"\"\n",
    "    :param df\\_1: the left table to join\n",
    "    :param df\\_2: the right table to join\n",
    "    :param key1: key column of the left table\n",
    "    :param key2: key column of the right table\n",
    "    :param threshold: how close the matches should be to return a match, based on Levenshtein distance\n",
    "    :param limit: the amount of matches that will get returned, these are sorted high to low\n",
    "    :return: dataframe with boths keys and matches\n",
    "    \"\"\"\n",
    "    s = df\\_2\\[key2\\].tolist()\n",
    "\n",
    "    m = df\\_1\\[key1\\].apply(lambda x: process.extract(x, s, limit=limit))    \n",
    "    df\\_1\\['matches'\\] = m\n",
    "\n",
    "    m2 = df\\_1\\['matches'\\].apply(lambda x: \\[i\\[0\\] for i in x if i\\[1\\] >= threshold\\]\\[0\\] if len(\\[i\\[0\\] for i in x if i\\[1\\] >= threshold\\]) > 0 else '')\n",
    "    df\\_1\\['matches'\\] = m2\n",
    "\n",
    "    return df\\_1\n",
    "    \n",
    "from fuzzywuzzy import fuzz\n",
    "from fuzzywuzzy import process\n",
    "\n",
    "df = fuzzy\\_merge(data, company, '公司名称', '公司名称', threshold=90)\n",
    "df\n",
    "\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
